Predefined Content Identifier Rules
Cyberhaven provides an extensive library of predefined Content Identifier Rules that form the foundation of the content inspection system. These rules are built using the Nucleuz Classification Engine and are designed to detect specific data types, patterns, and sensitive information with high accuracy and performance.
Overview
The Content matching rules page under Preferences provides a unified interface for you to define content inspection rules.
On the Rules tab, you can:
- View all predefined and custom rules
- Create new custom rules
- Delete custom rules
- Enable/Disable predefined and custom rules
Note
Predefined rules cannot be deleted.
Rules that are currently applied in a policy or dataset cannot be deleted.
Predefined Rules Library
Cyberhaven includes a comprehensive library of predefined classification rules that detect sensitive data types commonly found in organizations worldwide. These rules are organized into logical groupings and cover various data protection requirements, regulatory compliance needs, and industry-specific data types.
Rule Categories
The predefined rules are organized into several high-level categories:
Personal Identifiers
- Social Security Numbers: Various national formats with validation
- National ID Numbers: Country-specific identification formats
- Passport Numbers: International passport number patterns
- Driver License Numbers: Regional license number formats
- Tax Identification Numbers: Country-specific tax ID patterns
Financial Data
- Credit Card Numbers: Multiple card types with Luhn algorithm validation
- Bank Account Numbers: Various national and international formats
- IBAN Numbers: International Bank Account Number validation
- Routing Numbers: Banking routing and transit numbers
- SWIFT Codes: Bank identifier codes
Healthcare Data
- Medical Record Numbers: Healthcare identifier formats
- Drug Enforcement Agency (DEA) Numbers: US pharmaceutical tracking
- National Provider Identifiers: Healthcare provider IDs
- Health Insurance Numbers: Medical insurance identifiers
- Patient Identifiers: Various healthcare system IDs
Communication Data
- Email Addresses: Various email format patterns
- Phone Numbers: International and domestic phone formats
- IP Addresses: IPv4 and IPv6 address patterns
- URLs and Domains: Web address patterns
- MAC Addresses: Network hardware identifiers
Government and Legal
- Voter Registration Numbers: Electoral system identifiers
- Court Case Numbers: Legal system case identifiers
- License Numbers: Professional and business licenses
- Permit Numbers: Government permit identifiers
- Registration Numbers: Various government registrations
Authentication and Security
- API Keys: Various API key patterns
- Access Tokens: Authentication token formats
- Passwords: Password pattern detection
- Cryptographic Keys: Encryption key patterns
- Certificates: Digital certificate identifiers
Rule Structure and Components
Each predefined rule contains multiple components that work together to accurately detect sensitive data:
Pattern Matching
- Regular Expressions: Sophisticated regex patterns for format detection
- Format Validation: Specific formatting requirements (e.g., XXX-XX-XXXX)
- Length Constraints: Minimum and maximum character limits
- Character Sets: Allowed characters and encoding requirements
Validation Functions
- Checksum Algorithms: Mathematical validation (e.g., Luhn algorithm for credit cards)
- Format Verification: Structural validation of data patterns
- Range Validation: Numeric range checking where applicable
- Cross-Reference Validation: Verification against known valid patterns
Context Analysis
- Supporting Keywords: Contextual terms that increase confidence
- Proximity Analysis: Related terms within specified distance
- Document Structure: Location-based context (headers, forms, etc.)
- Language Support: Multilingual keyword recognition
Confidence Scoring
- Base Confidence: Initial confidence based on pattern match
- Context Boost: Additional confidence from supporting evidence
- Validation Confirmation: Confidence increase from successful validation
- Threshold Management: Configurable confidence thresholds
Rule Performance Characteristics
Detection Accuracy
- High Precision: Minimized false positives through validation
- Comprehensive Coverage: Multiple patterns for format variations
- Contextual Awareness: Reduced false positives through context analysis
- Adaptive Thresholds: Configurable sensitivity levels
Processing Efficiency
- Optimized Patterns: Regular expressions tuned for performance
- Parallel Processing: Rules designed for concurrent execution
- Memory Efficiency: Optimized memory usage patterns
- Scalable Architecture: Performance maintained at scale
Regional Adaptations
- Localized Patterns: Country-specific data formats
- Language Support: Multilingual keyword recognition
- Cultural Context: Region-appropriate detection patterns
- Regulatory Alignment: Compliance with local data protection laws
Example Rule Types
Social Security Number (US)
- Pattern: XXX-XX-XXXX format detection
- Validation: Area number and group number validation
- Context: Keywords like "SSN", "Social Security", "Tax ID"
- Confidence: High confidence with validation, medium without
Credit Card Numbers
- Pattern: 13-19 digit sequences with optional separators
- Validation: Luhn algorithm checksum verification
- Context: Keywords like "card", "credit", "payment"
- Types: Visa, MasterCard, American Express, Discover, etc.
Email Addresses
- Pattern: Local@domain format with RFC compliance
- Validation: Domain structure and character validation
- Context: Communication-related keywords
- Variations: Multiple format variations and international domains
IBAN Numbers
- Pattern: Country code + check digits + account identifier
- Validation: MOD-97 checksum algorithm
- Context: Banking and financial keywords
- Coverage: All IBAN-participating countries
Rule Management
Enabling Rules
Enable the predefined and custom rules you want to use for content inspection. Cyberhaven's content inspection engines will analyze content using the enabled rules to identify sensitive data patterns.
Rule Limitations
Note
You cannot disable rules currently in use within a policy or dataset.
There is a limitation on the total number of rules that can be enabled simultaneously, which depends on system resources and performance requirements.
Selection Guidelines
When selecting predefined rules:
- Data Relevance: Choose rules that match the types of sensitive data in your environment
- Regional Requirements: Select rules appropriate for your geographic regions
- Regulatory Compliance: Include rules required for applicable compliance frameworks
- Performance Impact: Consider the cumulative processing overhead of enabled rules
- Accuracy Requirements: Balance comprehensive coverage with acceptable false positive rates
Policy Association
Predefined rules are used within Content Identifier Policies to:
- Define Detection Scope: Specify which data types to detect
- Set Confidence Thresholds: Configure sensitivity levels
- Combine Multiple Rules: Create comprehensive detection policies
- Enable Contextual Detection: Leverage supporting evidence
Performance Considerations
Resource Usage
- CPU Impact: Processing overhead varies by rule complexity
- Memory Requirements: Rules consume system memory during execution
- I/O Considerations: Content scanning affects storage and network performance
- Scalability: Performance impact scales with content volume and rule count
Optimization Strategies
- Selective Enablement: Enable only necessary rules for your environment
- Threshold Tuning: Adjust confidence thresholds to balance accuracy and performance
- Rule Prioritization: Focus on high-value data types first
- Performance Monitoring: Track system performance with different rule configurations